Ranking XPaths for extracting search result records

نویسندگان

Dolf Trieschnigg

Kien Tjin-Kam-Jet

Djoerd Hiemstra

چکیده

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sample-based XPath Ranking for Web Information Extraction

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets auto...

متن کامل

An Ensemble Click Model for Web Document Ranking

Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...

متن کامل

A Scalable Image Snippet Extraction Framework for Integration with Search Engines

Search result visualization is a task performed by search engines that enables users to find their desired documents, in an effective and efficient manner. Image based summary or best images of a web document, displayed as a part of the visualization process, has become indispensable, as a human perceives images instantaneously. But, selection of the best image increases latency in search resul...

متن کامل

Accelerating E-Commerce Search Engine Ranking by Contextual Factor Selection

In industrial large-scale search systems, such as Taobao.com search for commodities, the quality of the ranking result is getting continually improved by introducing more factors from complex procedures, e.g., deep neural networks for extracting image factors. Meanwhile, the increasing of the factors demands more computation resource and raises the system response latency. It has been observed ...

متن کامل

Incremental Web Search: Tracking Changes in the Web

A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. In this thesis, we present our solutions of searching new information from the web by tracking the changes of web documents. First, we present the algorithms and techniques useful for solving the following...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Ranking XPaths for extracting search result records

نویسندگان

چکیده

منابع مشابه

Sample-based XPath Ranking for Web Information Extraction

An Ensemble Click Model for Web Document Ranking

A Scalable Image Snippet Extraction Framework for Integration with Search Engines

Accelerating E-Commerce Search Engine Ranking by Contextual Factor Selection

Incremental Web Search: Tracking Changes in the Web

عنوان ژورنال:

اشتراک گذاری